Prioritized experience replay was proposed by T. Schaul et. al., and is widely used to speed up reinforcement learning (as far as I know).

Roughly speaking, mis-predicted observations will be learned more frequently. To compensate distorted probability, weight of learning is scaled to the opposite direction (cf. importance sampling).

In cpprb, `PrioritizedReplayBuffer`

class implements these functionalities with proportional base (instead of rank base) priorities as shown at bellow table.

\(i\)-th | Definition |
---|---|

Probability: \( P(i) \) | \( \frac{(P_{i}+\epsilon)^{\alpha}}{\sum _{j=0}^{N} (P_{j}+\epsilon)^{\alpha}} \) |

Weight: \( w_i \) | \( \frac{\left( \frac{1}{N} \frac{1}{P(i)}\right) ^{\beta}}{\left( \frac{1}{N} \frac{1}{\min _{j} P(j)}\right) ^{\beta}} \to \left(\frac{\min _{j} P(j)}{P(i)}\right) ^{\beta} \) |

You can `add`

`priorities`

together with other environment. If no `priorities`

are passed, the stored maximum priority is used.

The `dict`

returned by `sample`

also has special key-values of `indexes`

and `weights`

. The `indexes`

are intended to be passed to `update_priorities`

to update their priorities after comparison with new prediction. The `weights`

can be used to scale errors when updating network parameters. Since the `weights`

are normalized by the largest weight, as the original research did, the values are always less than or equal to 1.

`PrioritizedReplayBuffer`

has hyperparameters `alpha`

(\( \alpha \)) and `eps`

(\( \epsilon \)) at constructor and `beta`

(\( \beta \)) at `sample`

method. Their default values are `0.6`

, `1e-4`

, and `0.4`

, respectively. The detail is described in the original paper above.

Parameters | Default | Description |
---|---|---|

`alpha` (\( \alpha \)) |
\( 0.6 \) | Prioritized parameter. `0` means uniform sampling. |

`beta` (\( \beta \)) |
\( 0.4 \) | Compensation parameter. `1` means fully compensate (i.e. Importance Sampling) |

`eps` (\( \epsilon \)) |
\( 10^{-4} \) | Small value to avoid `0` priority. |

```
import numpy as np
from cpprb import PrioritizedReplayBuffer
buffer_size = 256
prb = PrioritizedReplayBuffer(buffer_size,
{"obs": {"shape": (4,4)},
"act": {"shape": 3},
"rew": {},
"next_obs": {"shape": (4,4)},
"done": {}},
alpha=0.5)
for i in range(1000):
prb.add(obs=np.zeros((4,4)),
act=np.ones(3),
rew=0.5,
next_obs=np.zeros((4,4)),
done=0)
batch_size = 32
s = prb.sample(batch_size,beta=0.5)
indexes = s["indexes"]
weights = s["weights"]
# train
# ...
new_priorities = np.ones_like(weights)
prb.update_priorities(indexes,new_priorities)
```

The constructor of `PrioritizedReplayBuffer`

(aka. `__init__`

) has parameter `check_for_update`

. If the parameter is `True`

(default is `False`

), the buffer traces updated indices after the last calling of `sample()`

method and the `update_priorities()`

method updates priorities only at unchanged indices. This feature is designed for multiprocess learning in order to avoid mis-updating priorities of already overwritten values. (This protection is always enabled at `MPPrioritizedReplayBuffer`

)

To choose prioritized sample efficiently, partial summation and minimum of pre-calculated weights are stored in Segment Tree data structure, which is written by C++ and which was an initial main motivation of this project.

To support multiprocessing, the Segment Tree can be lazily updated, too. (`MPPrioritizedReplayBuffer`

)

Minibatch sampling is done by stratified sampling, which samples from every equal probability space, resulting smaller distributions among minibatches.